DOMAIN :- Telecom

Import and warehouse data

Task: Import all the given datasets and explore shape and size of each.

Task: Merge all datasets onto one and explore final shape and size

Data cleansing

Task: Explore and if required correct the datatypes of each attribute

Explore for null values in the attributes and if required drop or impute values

Converting Float to Categorical

Data analysis & visualisation:

Detailed statistical analysis on the data

Dataset has proper distribution on each gender,but Male has higher value count

Dataset shows majortiy of the people are young

Dataset has equal distribution on Partner, but without partner has higher value counts

70% data has no-dependents

Majority of them use Fiber optic as a Internet service

Most of them prefer paperless billing

Observation on above

1.Gender distribution shows that the dataset features a relatively equal proportion of male and female customers. Almost half of the customers in our dataset are female whilst the other half are male.

2.Most of the customers in the dataset are younger people.

3.Not many customers seem to have dependents whilst almost half of the customers have a partner.

4.Most of the customers seem to have phone service and 3/4th of them have opted for paperless Billing

Distribution of label encoded categorical variables:

1.Most of the customers have phone service out of which almost half of the customers have multiple lines.

2.3/4th of the customers have opted for internet service via Fiber Optic and DSL connections with almost half of the internet sers subscribing to streaming TV and movies.

3.Customers who have availed Online Backup, Device Protection, Technical Support and Online Security features are a minority.

A preliminary look at the overall churn rate shows that around 74% of the customers are active. As shown in the chart above, this is an imbalanced classification problem.

Bivariant analysis

Data Pre-Processing

on comparing the original dataset with train dataset, we see similarity between those two datasets.

Model training, testing and tuning:

Using entropy

Training data has 99% accuracy - seems to be over fitting.

Tree looks Overfitting

Doing Reduction in tree

From the dataset, we can see Tenure,Monthly charges, Total charges, Internet services,Contract are having high important features that give value to churn data.

The model accuracy comes around 78%

Using Gini

Compared to the Entroy and Gini in DT - DT using Gini has higher value

Using Random forest

Apply Adaboost Ensemble Algorithm for the same data and print the accuracy.

Adaboosting Ensemble perform better than DT and RF model

Apply Bagging Classifier Algorithm and print the accuracy.

Apply GradientBoost Classifier Algorithm for the same data and print the accuracy

Comparing all the above, Gradient Boosting has higher accuracy on test data

For this dataset, Boosting models give the best results on test predictions. So, this model can be selected to make prediction

From the above, we can definitely conclude that Gradient Boost is an optimal model of choice for the given dataset as it has relatively the highest accuracy scores; giving most number of correct positive predictions while minimizing the false negatives

Insights

  1. We made use of a customer churn dataset to build a machine learning classifier that predicts the propensity of any customer to churn in months to come with a reasonable accuracy score of 76% to 80%

Suggestion: Dataset need to include - Occupation, Secondary Network Provide to compare the plans.